Sains Malaysiana 54(6)(2025): 1629-1639

http://doi.org/10.17576/jsm-2025-5406-17

 

SMOTE-PCADBSCAN: A Novel Approach for Addressing Class Imbalance in Water Quality Prediction

(SMOTE-PCADBSCAN: Suatu Pendekatan Novel untuk Menangani Ketidakseimbangan Kelas dalam Ramalan Kualiti Air)

 

NORASHIKIN NASARUDDIN1,2,*, NURULKAMAL MASSERAN1, WAN MOHD RAZI IDRIS3 & AHMAD ZIA UL-SAUFIE4

 

1Department of Mathematical Sciences, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysia

2Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA (UiTM) Kedah Branch, 08400 Merbok, Kedah, Malaysia

3Department of Earth Science and Environment, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysia

4Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA (UiTM), 40450 Shah Alam, Selangor, Malaysia

 

Diserahkan: 12 Ogos 2024/Diterima: 13 Mac 2025

 

Abstract

An accurate and trustworthy prediction model is essential for supporting policy decisions in environmental management concerning water quality prediction. Nonetheless, imbalanced datasets are prevalent in this discipline and hinder identifying crucial ecological factors accurately. This study proposed a novel SMOTE-PCADBSCAN model to enhance the categorisation of water quality data by employing three key components: (i) synthetic minority over-sampling technique (SMOTE), (ii) principal component analysis (PCA), and (iii) density-based spatial clustering of applications with noise (DBSCAN). The minority class was initially augmented using SMOTE, which PCA then decreased the dimensionality. Subsequently, DBSCAN was utilised to generate superior-quality synthetic data by detecting and eliminating extraneous data points. A Malaysia-based multi-class water quality dataset was employed to determine the efficiency of this model. Four different versions of the dataset (Original, SMOTE, SMOTE-DBSCAN, and SMOTE-PCADBSCAN) also utilised five classifier types for the analysis process: (i) decision tree, (ii) random forest, (iii) gradient boosting method, (iv) adaptive boosting, and (v) extreme gradient boosting. Although the original datasets exhibited great accuracy, class imbalance occurred when detecting minority classes. Among the datasets, the metric performances of SMOTE-DBSCAN and SMOTE-PCADBSCAN-based synthetic datasets were superior. The highest accuracy and optimal F1 scores were also demonstrated by RF using the SMOTE-PCADBSCAN approach, which presented excellent water quality classification and imbalanced data management. Consequently, the classification accuracy of imbalanced environmental datasets could be enhanced by employing advanced oversampling techniques and ensemble approaches.

Keywords: DBSCAN; imbalanced data; PCA; SMOTE; water quality

 

Abstrak

Model ramalan yang tepat dan boleh dipercayai adalah penting untuk menyokong keputusan dasar dalam pengurusan alam sekitar berkaitan ramalan kualiti air. Walau bagaimanapun, set data yang tidak seimbang sering berlaku dalam disiplin ini dan menghalang pengenalan faktor ekologi yang penting dengan tepat. Penyelidikan ini mencadangkan model SMOTE-PCADBSCAN yang inovatif untuk meningkatkan pengelasan data kualiti air dengan menggunakan tiga komponen utama: (i) teknik pengambilan sampel berlebihan minoriti sintetik (SMOTE), (ii) analisis komponen utama (PCA) dan (iii) pengelompokan ruang berasaskan ketumpatan aplikasi dengan bunyi (DBSCAN). Kelas minoriti pada mulanya ditambah menggunakan SMOTE, yang kemudiannya mengalami pengurangan dimensi oleh PCA. Seterusnya, DBSCAN digunakan untuk menghasilkan data sintetik berkualiti tinggi dengan mengesan dan menghapuskan titik data yang tidak relevan/berlebihan. Set data kualiti air pelbagai kelas dari Malaysia digunakan untuk menentukan keberkesanan model ini. Empat versi dataset yang berbeza (Asal, SMOTE, SMOTE-DBSCAN dan SMOTE-PCADBSCAN) melibatkan lima jenis pengelas untuk proses analisis: (i) pokok keputusan, (ii) hutan rawak, (iii) mesin penggalakan kecerunan, (iv) penggalakan adaptif dan (v) penggalakan kecerunan ekstrem. Walaupun dataset asal menunjukkan ketepatan yang tinggi, ketidakseimbangan kelas berlaku apabila mengesan kelas minoriti. Antara dataset, prestasi metrik dataset sintetik berasaskan SMOTE-DBSCAN dan SMOTE-PCADBSCAN adalah lebih baik. Ketepatan tertinggi dan skor F1 optimum juga ditunjukkan oleh RF menggunakan pendekatan SMOTE-PCADBSCAN yang menunjukkan prestasi cemerlang dalam pengelasan kualiti air dan pengurusan data tidak seimbang. Oleh itu, ketepatan pengelasan dataset alam sekitar yang tidak seimbang boleh dipertingkatkan dengan menggunakan teknik pengambilan sampel berlebihan lanjutan dan pendekatan ansambel.

Kata kunci: Data tidak seimbang; DBSCAN; kualiti air; PCA; SMOTE

 

RUJUKAN

Abedinia, A. & Seydi, V. 2024. Building semi-supervised decision trees with semi-cart algorithm. International Journal of Machine Learning and Cybernetics 15: 4493-4510.

Ahmed, M.F., Mokhtar, M.B., Lim, C.K. & Majid, N.A. 2022. Identification of water pollution sources for better Langat River basin management in Malaysia. Water 14(12): 1904.

Ahmed, M.F., Mokhtar, M.B., Alam, L., Mohamed, C.A.R. & Ta, G.C. 2020. Investigating the status of cadmium, chromium and lead in the drinking water supply chain to ensure drinking water quality in Malaysia. Water 12(10): 2653.

Alqahtani, A., Shah, M.I., Aldrees, A. & Javed, M.F. 2022. Comparative assessment of individual and ensemble machine learning models for efficient analysis of river water quality. Sustainability 14(3): 1183.

Arafa, A., El-Fishawy, N., Badawy, M. & Radad, M. 2022. RN-SMOTE: Reduced noise SMOTE based on DBSCAN for enhancing imbalanced data classification. Journal of King Saud University - Computer and Information Sciences 34(8): 5059-5074.

Blahova, L., Horecny, J. & Kostolny, J. 2023. Segmentation of MRI images using clustering algorithms. IEEE International Conference on Information and Digital Technologies, IDT 2023. pp. 169-178.

Breiman, L. 2001. Random forests. Machine Learning 45: 5-32.

Chawla, N.V., Bowyer, K.W., Hall, L.O. & Kegelmeyer, W.P. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16: 321-357.

Chen, T. & Guestrin, C. 2016. XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp. 785-794.

Cheng, K., Zhang, C., Yu, H., Yang, X., Zou, H. & Gao, S. 2019. Grouped SMOTE with noise filtering mechanism for classifying imbalanced data. IEEE Access 7: 170668-170681.

Dalakleidi, K., Zarkogianni, K., Thanopoulou, A. & Nikita, K. 2017. Comparative assessment of statistical and machine learning techniques towards estimating the risk of developing type 2 diabetes and cardiovascular complications. Expert Systems 34(6): e12214.

Department of Environment Malaysia. 2022. Laporan Kualiti Alam Sekeliling 2022. Putrajaya: Jabatan Alam Sekitar Malaysia.

Dogo, E.M., Nwulu, N.I., Twala, B. & Aigbavboa, C. 2021. Accessing imbalance learning using dynamic selection approach in water quality anomaly detection. Symmetry 13(5): 818.

Dong, X., Yu, Z., Cao, W., Shi, Y. & Ma, Q. 2020. A survey on ensemble learning. Frontiers of Computer Science 14(2): 241-258.

Douzas, G., Bacao, F. & Last, F. 2018. Improving imbalanced learning through a heuristic oversampling method based on K-means and SMOTE. Information Sciences 465: 1-20.

Ester, M., Kriegel, H.P., Sander, J. & Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96(34): 226-231.

Fitri, A., Maulud, K.N.A., Pratiwi, D., Phelia, A., Rossi, F. & Zuhairi, N.Z. 2020. Trend of water quality status in Kelantan River downstream, Peninsular Malaysia. Jurnal Rekayasa Sipil (JRS-Unand) 16(3): 178-184.

Hashem, A.O.A., Ahmad, W.A.A.W. & Yusuf, S.Y. 2021. Water quality status of Sungai Petani River, Kedah, Malaysia. IOP Conference Series: Earth and Environmental Science 646(1): 012028.

Jeatrakul, P., Wong, K.W. & Fung, C.C. 2010. Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm. Proceedings of the 17th International Conference on Neural Information Processing (ICONIP 2010), Part II, Sydney, Australia, pp. 152-159. Springer.

Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatco, C.D., Silverman, R. & Wu, A.Y. 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Learning 24(7): 881-892.

Kavitha, R.J. & Caroline, B.E. 2015. Hybrid cryptographic technique for heterogeneous wireless sensor networks. 2015 International Conference on Communication and Signal Processing, ICCSP 2015. pp. 1016-1020.

Kumar, K.M. & Reddy, A.R.M. 2016. A fast DBSCAN clustering algorithm by accelerating neighbor searching using groups method. Pattern Recognition 58: 39-48.

Marsboom, C., Vrebos, D., Staes, J. & Meire, P. 2018. Using dimension reduction PCA to identify ecosystem service bundles. Ecological Indicators 87: 209-260.

Mustakim, E., Rahmi, M.R, Mundzir, S.T., Rizaldi, Okfalisa & Maita, I. 2021. Comparison of DBSCAN and PCA-DBSCAN Algorithm for Grouping Earthquake Area. In: Proceedings of the 2021 International Congress of Advanced Technology and Engineering (ICOTEN 2021), Taiz, Yemen, pp. 1-5.

Poudevigne-Durance, T. 2024. Generative adversarial networks for the synthesis of unbalanced irregular time series. Doctoral dissertation, Cardiff University (Unpublished).

Rahman, M.A., Hossain, M.F., Hossain, M. & Ahmmed, R. 2020. Employing PCA and T-statistical approach for feature extraction and classification of emotion from multichannel EEG signal. Egyptian Informatics Journal 21(1): 23-35.

Sander, J., Ester, M., Kriegel, H.P. & Xu, X. 1998. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery 2: 169-194.

Sarker, I.H. 2021. Machine learning: Algorithms, real-world applications and research directions. SN Computer Science 2(3): 160.

Schapire, R.E. 1999. A brief introduction to boosting. IJCAI International Joint Conference on Artificial Intelligence 99(999): 1401-1406.

Shehab, S.A., Darwish, A., Hassanien, A.E. & Scientific Research Group in Egypt. 2023. Water quality classification model with small features and class imbalance based on fuzzy rough sets. Environment, Development and Sustainability 27: 1401-1419.

Shen, X., Hu, H., Li, X. & Li, S. 2021. Study on PCA-SAFT imaging using leaky Rayleigh waves. Measurement 170: 108708.

Starczewski, A., Goetzen, P. & Er, M.J. 2020. A new method for automatic determining of the DBSCAN parameters. Journal of Artificial Intelligence and Soft Computing Research 10(3): 209-221.

Taloor, A.K., Sambyal, S., Sharma, R., Dev, S., Shastri, S. & Kumar, R. 2025. Advanced hydrogeochemical facies classification: A comparative analysis of Machine Learning models with SMOTE in the Tawi basin. Physics and Chemistry of the Earth, Parts A/B/C 137: 103785.

Tran, T.N., Drab, K. & Daszykowski, M. 2013. Revised DBSCAN algorithm to cluster data with dense adjacent clusters. Chemometrics and Intelligent Laboratory Systems 120: 92-96.

Wong, W.Y., Hasikin, K., Khairuddin, M., Salwa, A., Razak, S.A., Hizaddin, H.F., Mokhtar, M.I. & Azizan, M.M. 2023. A stacked ensemble deep learning approach for imbalanced multi-class water quality index prediction. Comput. Mater. Contin. 76(2): 1361-1384.

Yasin, M.I. & Karim, S.A.A. 2020. A new fuzzy weighted multivariate regression to predict water quality index at Perak Rivers. In S. Karim, E. Kadir & A. Nasution (Eds.), Optimization Based Model Using Fuzzy and Other Statistical Techniques Towards Environmental Sustainability (pp. 1-27). Singapore: Springer. pp. 1-27.

 

*Pengarang untuk surat-menyurat; email: norashikin116@uitm.edu.my

 

 

 

 

 

 

 

 

           

sebelumnya